A combined monthly precipitation prediction method based on CEEMD and improved LSTM

With the continuous decline of water resources due to population growth and rapid economic development, precipitation prediction plays an important role in the rational allocation of global water resources. To address the non-linearity and non-stationarity of monthly precipitation, a combined prediction method based on complementary ensemble empirical mode decomposition (CEEMD) and a modified long short-term memory (LSTM) neural network was proposed. Firstly, the CEEMD method was used to decompose the monthly precipitation series into a set of relatively stationary sub-sequence components, which can better reflect the local characteristics of the sequence and further understand the nonlinear dynamic characteristics of the sequence. Then, improved LSTM neural networks were employed to predict each sub-sequence. The proposed improvement method optimized the hyper-parameters of LSTM neural networks using particle swarm optimization algorithm, which avoided the randomness of artificial parameter selection. Finally, the predicted results of each component were superimposed to obtain the final prediction result. The proposed method was validated by taking the monthly precipitation data from 1961 to 2020 in Changde City, Hunan Province as an example. The results of the case study show that, compared with other traditional prediction methods, the proposed method can better reflect the trend of precipitation changes and has higher prediction accuracy.


Introduction
The World Meteorological Organization (WMO) published the report "State of Climate Services 2021: Water" in October 2021, stating that climate change, particularly the increased frequency of extreme heat, will cause a global water crisis and that governments and relevant international organizations are still lacking effective response measures [1]. Global heat, drought, and various types of extreme weather have resulted in a drop in global precipitation, yet with the rapid expansion of society, the usage of freshwater resources has increased more than 35 times in the previous 300 years, resulting in a water crisis in many parts of the world. In response, China's Ministry of Water Resources issued its 2022 Water Resources Management Essentials, which suggests increasing the level of refinement in water resource management. Precipitation forecasting is a key foundation for enhanced water resource management and an important instrument for extreme weather warnings. This research proposes a combined monthly precipitation prediction model based on CEEMD-PSO-LSTM to increase the accuracy of monthly precipitation prediction and can provide a decision basis for energy warning and meteorological catastrophe prevention. Numerous techniques for predicting precipitation have long been studied by researchers both domestically and overseas. The primary prediction techniques can be categorised into three groups: machine learning methods, statistical prediction methods, and causal prediction methods. The basic foundation of causal prediction methods is the study of climatic changes, atmospheric circulation processes, and the creation of causal models [2][3][4]. However, because there are so many variables that might affect atmospheric precipitation, choosing the right one makes it challenging to predict precipitation using the genesis model. To build prediction models based on historical precipitation data, statistical prediction methods use time-tested techniques like Markov chains [5] and autoregressive integrated moving average (ARIMA) [6]. The non-smooth and non-linear properties of precipitation data are frequently ignored by statistical prediction methods, which leads to typically subpar prediction accuracy. Due to the complexity and randomness of mid-long term precipitation, machine learning methods mostly use neural network algorithms to develop forecasting models, such as LSTM [7][8][9]. This can result in reduced convergence speed and forecasting accuracy of neural networks. In summary, single machine learning algorithms, statistical prediction models, and genetic prediction methods all have limitations and do not provide accurate prediction results for mid-long term precipitation. Both domestic and foreign scholars have made certain improvements to single neural network algorithms, such as optimizing the hyper-parameters of neural networks using optimization algorithms [10,11], which avoids the randomness of manual selection. For nonlinear data, decomposition methods [9,12,13] can be used to decompose sequences into multiple IMFs in order to better capture the nonlinear components in the signal and improve prediction accuracy.
Empirical model decomposition (EMD) [14] is a decomposition model proposed by Huang N. E. in 1988 that has strong advantages for processing non-linear and non-stationary data. With the gradual deepening of EMD research, it is discovered that the process of adaptive decomposition in EMD inevitably involves modal confusion. Huang [15] et al. suggested the ensemble empirical model decomposition (EEMD) method, which successfully lessens the influence of the mode mixing phenomenon on the results while maintaining the physical uniqueness of the derived IMF components. In summary, the monthly precipitation data was decomposed by using the CEEMD method, which is an improvement over EEMD and effectively reduces the reconstructed signal noise so that the residual white noise is negligible;For nonlinear and non-stationary data, the signal characteristics are usually complex and dynamic, making it difficult to effectively process and extract them using traditional linear analysis methods such as Fourier analysis and wavelet analysis. However, the CEEMD algorithm based on EMD can achieve adaptive decomposition of non-linear and non-stationary signals and decompose them into multiple IMFs, thereby better capturing the non-stationary and non-linear components contained in the signal. LSTM neural networks are widely valued due to their superiority in capturing Spatio-temporal relationships [16]. In time series data, current data is often influenced by data from multiple previous time steps. Traditional feedforward neural networks cannot meet this requirement, resulting in errors in prediction. However, the design of long-and short-term memory units in LSTM networks allows for effective learning and memory of information from previous time steps, thereby better handling long-term dependencies and improving prediction performance. LSTM networks were used to predict the decomposed IMF components; to avoid the contingency of traditional manual selection of LSTM hyperparameters, the particle swarm algorithm [17][18][19][20] was used to find the optimal hyperparameters in LSTM. Compared with other optimization algorithms, PSO (Particle Swarm Optimization) algorithm can quickly converge to the global optimal solution and usually has faster calculation speed than other algorithms. The PSO algorithm has good robustness and reliability, is less dependent on initial parameters, and is not easily trapped in local optimal solutions. A combined CEEMD-PSO-LSTM prediction model is established, which effectively improves the prediction accuracy on the basis of traditional precipitation prediction method.
To test the effectiveness of the model, Changde City, known as an international wetland city and an international garden city, was selected as the study area, and the model was applied to the monthly precipitation prediction of Changde City and compared with PSO-LSTM [21], EMD-LSTM [22] and EEMD-LSTM [23]. The analysis of the algorithm results shows that the combined CEEMD-PSO-LSTM prediction model has better prediction results and provides more effective decision support for water resources development planning for global development.

CEEMD
Huang N E [14] proposed EMD in 1988 as part of the Hilbert-Huang transform (HHT), which has the advantage of processing non-stationary signals. The core of EMD is to decompose the signal into several intrinsic model function (IMF) and monotonic residual by using the signal polar point information. The EEMD decomposition is a solution to the problem of modal aliasing by adding normally distributed white noise to the original order and then decomposing the white noise as a whole to obtain the corresponding IMF components. In the CEEMD decomposition process, multiple sets of white noise with opposite signs are added to the original signal, which effectively reduces the noise of the reconstructed signal and achieves the goal of negligible residual white noise. The CEEMD decomposition process is as follows: (1) Add L sets of white noise of opposite sign to the original sequence X(t) to obtain a positive noise sequence X 1 and a negative noise sequence X 2 , with the total number of sequences being 2L.
Where N is the white noise sequence.
(3) By taking the average of IMF þ ij ðtÞ and IMF À ij ðtÞ for each set of IMF components, determine the value of the j th component.
(4) Each set of IMF values obtained is accumulated to obtain the original sequence. The formula is as in Eq (3): Where r(t) is the residual trend term, i.e., the residual.

PSO
The main idea of the PSO algorithm is to use "particles" to simulate the behaviour of a "flock of birds foraging". When PSO solves an optimisation problem, each particle has its own position, velocity, and fitness value. The problem solution corresponds to the position of each particle in the search space, and PSO completes the optimisation by the particle following the optimal solution it finds and the optimal solution of the population. Assume that there is currently a D-dimensional search space with a population of N particles, the position of the i th particle is: ; the individual history optimal adaptation value is f p ; the population history optimal adaptation value is f g ; the individual optimal solution is P id,pbest = (p i1 , p i2 , � � �, p iD ); the population optimal solution is P d,gbest = (p 1,gbest , p 2,gbest , � � �, p d,gbest ). The velocity update equation is given by equation: where i is the particle serial number; k is the number of iterations; d is the particle dimension serial number; ω is inertia weight; c 1 and c 2 are the individual learning factor and population learning factor respectively; r 1 and r 2 are random numbers between 0 and 1; v k id and x k id are the velocity vector and position vector of particle i in dimension d at the k th iteration respectively; p k id; phest is the historical optimal position of particle i in dimension d at the k th iteration; p k d; gbest is the historical optimal position of the population in dimension d at the k th iteration; the first term to the right of the equal sign is the inertia term. The larger the ω the greater the inertial exploration capacity of the particle; the 2nd term to the right of the equal sign is the cognitive term, i.e. the distance and direction between the current position of a particle and its historically optimal position; the 3rd term to the right of the equal sign is the social term, i.e.the distance and direction between the current position of the particle and the optimal position of the group history. The position update equation is as follows: If the current number of iterations reaches the pre-set maximum number or the minimum difference in the adaptation value between two iterations, stop the iteration and output the optimal solution, otherwise the particle velocity and position are updated using equations, and.
The MSE of LSTM network prediction was used as the particle fitness value: Where y i is the true value;ŷ i is the predicted value; n is the number of samples.

LSTM
The LSTM network is a special kind of recurrent neural network whose memory module structure retains the recurrent feedback mechanism while introducing gating units to control the rate of information accumulation and adding forgetting gates to selectively control the addition of new information, thus solving the long-term dependency problem that exists in sequence modelling. The LSTM neural network adds input, output and forget gate structures to the RNN. The architecture of LSTM block is shown in Figs 1 and 2, which mainly consists of a memory unit, an oblivion gate, an input gate and an output gate, all of which take values in the range of (0, 1) and are controlled by a sigmoid function. The memory unit is the key component in the LSTM module, and its state at moment t is c t , which contains information about the long-term memory of the series. Assuming moment t, the inputs to the memory module in the LSTM include the input sequence X t , state c t−1 of memory cell at time t − 1 and state h t of hidden layer at time t − 1. In the memory module, the forgetting gate controls how much of the value of c t−1 is forgotten at the previous moment and controls the degree of influence of c t−1 on c t ; in the input gate, the linear combination of the state of the hidden layer at the previous moment h t−1 and the input sequence X t are used as the input to the sigmoid function.c t is the information retained after input gate control, i.e. the extent to which control X t affects c t ; in the output gate, the linear combination of the previous implicit state h t−1 and the input sequence X t is used as input to the sigmoid function, which determines the output information to be retained by the memory neuron, i.e. the degree of influence of c t on h t . Equations-for each of the three gates are as follows: where f t , i t , and o t are the output results of the forget, input, and output gates at time t;W f , W i , and W o are the weight matrices of the forget, input, and output gates; b f , b i , and b o are the bias terms of the forget, input, and output gates respectively; σ is the sigmoid activation function.
The memory module c t is obtained by adding and multiplying c t−1 with the addition of the informationc t retained after being controlled by the input gate, calculated as follows: The state of the memory cell c t and the state of the hidden layer h t at time t are calculated as follows: In Formula (10), W c and b c denote the weight matrix and bias term of the input cell state; tanh denotes the hyperbolic tangent activation function.

Monthly precipitation prediction model based on CEEMD-PSO-LSTM
The predictability of monthly precipitation data using traditional prediction techniques is poor because they are inherently variable, non-linear, and non-smooth. This paper selects the CEEMD decomposition model and LSTM network prediction model, selects the PSO algorithm to optimise the hyperparameters in the LSTM, and proposes a combined CEEMD-P-SO-LSTM prediction model in light of the benefits of empirical modal decomposition in series smoothing and the excellent performance of long and short-term memory neural networks in time series data prediction. The data decomposition, PSO optimization, and LSTM network prediction are the three key stages that make up the combined prediction model. Fig 3 illustrates the specific steps in this procedure.
(1) CEEMD decomposition phase. The CEEMD model is used to decompose the monthly precipitation data into the L-group IMF components {IMF 1 , IMF 2 , � � �, IMF L } and the residual error RES.
(2) PSO optimization phase. The PSO-LSTM model is used to make LSTM predictions for each set of IMF components separately, and the hyperparameters of the LSTM network are optimized by PSO. The specific steps are as follows: ① Initialize the particle swarm parameters. Determine the population size, number of layers, number of iterations, individual and population learning factors, limited range of particle position and velocity values, and inertia weights. ② Randomly initialise the particle velocity and position in the bounded range. Randomly generate a particle, K is the number of iterations of the LSTM, lr is the learning rate, H 1 is the number of neurons in layer 1 hidden layer, and H 2 is the number of neurons in layer 2 hidden layer.
③ Determine the fitness function of the PSO algorithm. The LSTM model is constructed with the initialized parameters, and the mean square error between the true value and the predicted value is used as the fitness function of the particle population, as shown in Fig 3. ④ Calculate the position and fitness value of the particles at each iteration. The positions and velocities of the particles are continuously adjusted according to equations until the fitness function is minimized to determine the optimal positions and thus the optimal parameters of the LSTM.

PLOS ONE
(3) LSTM prediction phase. The optimal parameters determined after the PSO search are used to predict each set of IMF components obtained from the decomposition, and then the predicted values of each set of IMFs and the predicted values of the residual term RES are added together to obtain the final prediction results.
In this paper, RMSE and MAE are selected as evaluation indexes, and the specific formulas are as follows: RMSE ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P N r¼1 ðx r Àx r Þ 2 N r ð13Þ In the formula: x r is the true value;x r is the predicted value; N is the number of samples.

Experimental results
The experimental environment is Intel(R) Core (TM) i7-10510U, 2.30GHz processor, NVI-DIA GeForce MX250 graphics card. Algorithm model uses MATLAB R2022a as programming language.

Overview of the study area
Changde is in the south of mainland China, in the northwest of Hunan Province, and is famous as the "land of fish and rice" in the south of the Yangtze River. It is located in the Dongting Lake system in the middle reaches of the Yangtze River, the lower reaches of the Yuan River, and the middle and lower reaches of the river, as well as the northeastern end of the Wuling Mountains and the Xuefeng Mountains. Changde is 174.6 kilometres wide from east to west and 187.2 kilometres long from north to south, with a total area of 18,200 square kilometres. With an average annual temperature of 16.7˚C and 1200-1900 mm of precipitation, Changde has a subtropical humid monsoon climate. Water resources in Changde are relatively abundant, with a total of 15.337 billion cubic meters of water resources on average over the years, with a per capita possession of 2,556 cubic meters. Changde has abundant rainfall and water resources mainly come from precipitation, which is unevenly distributed in space and time, with precipitation and runoff accounting for more than 70% of the year during the period of abundant water (April to October). Changde is one of the first international wetland cities and international garden cities in China, so it is important to carry out medium and long-term precipitation forecasting work in the area.

Research data sources and pre-processing
The monthly precipitation data from seven representative meteorological stations in Changde, Hunan Province, from January to December 1961 to 2020 were selected, and the distribution of meteorological stations is shown in Fig 4. In view of the limited space of the article, the validity and accuracy of the combined prediction model are mainly verified by using the measured data from Changde station with station number 57662.
The data is from Changde Meteorological Bureau. The data is true and accurate, part of the data is shown in Table 1. The number of data samples is 720.
Data for August to October 1976 were missing and Three times Hermite interpolation [24] were performed on the data for that year to maintain data continuity and reduce data loss. The final results obtained are shown in Table 2.

CEEMD decomposition of monthly precipitation time series
The monthly precipitation time series has obvious non-linearity and non-smoothness, and CEEMD was used to decompose the series. When decomposed, add a white noise amplitude of 0.02 times the standard deviation of the original signal, set the average number of processing to 50. The original series was decomposed into eight IMF components [25], and the effect is shown in Fig 5, where each IMF presents the influence of different influencing factors on precipitation at different scales. Compared with the EMD IMFs, the CEEMD-processed IMFs do not show the mode mixing that often occurs in EMD, and each IMF contains significantly different characteristic time scales. After CEEMD processing, it can be seen that the auxiliary

PLOS ONE
noise residuals of the IMF components are decreased and the signal-to-noise ratio is increased compared to the EEMD IMFs. This allows the information of the original series to be more accurately reflected, and the total number of set averaging, the decomposition takes less time.

PSO-LSTM network prediction
LSTM prediction was performed on each of the eight IMF and RES sequences obtained after CEEMD decomposition. Before making the predictions, each sequence data was first normalised separately with the following equation.
where x i is the original data; x max , x min are the maximum and minimum values of the original data respectively; and X i is the normalised data. The timestep of the input samples for the LSTM network is 12, with 12 consecutive months of data as the input variable and the next month's data as the output variable. The data set is divided as shown in Table 3, and the network uses a mini-batch input, with the number of samples per input, batch-size = 16.

PLOS ONE
The LSTM network architecture adopts a 2 + 1 stack structure (2 layers of LSTM and 1 layer of fully connected layer). In order to prevent the neural network from overfitting, Dropout technology is added to each layer [26] (parameter value is 0.2). Dropout technology is to randomly discard some neurons in the neural network model, the weights of the discarded

PLOS ONE
neurons are set to zero, and discard neurons do not participate in network training forward calculation and reverse calculation, reducing the weight parameters, reduce the overfitting phenomenon, Fig 6 is a schematic diagram of Dropout technology. The neural network training uses the full iterative epoch method (the sample set is recorded as 1 epoch after training), the loss function uses the root mean square error, and the gradient optimization algorithm uses Adam [27]. After dividing the training set, the training set data is input into LSTM, and a loss value is generated by forward propagation calculation. According to the loss value, the Adam optimizer uses the BPTT algorithm to adjust the weight of LSTM. By comparing the fitness function, the PSO algorithm is used to find the optimal number of two hidden layer units H 1 , H 2 , the number of training K, and the learning rate lr. The PSO partial parameter settings are shown in Table 4, M is the maximum iteration parameter. H 1 , H 2 in the range [1,200], K in the range [10,100], lr in the range [0.001, 0.01]. As the number of training iterations increases, the accuracy of LSTM predictions improves. The results of hyper-parameter optimization can be found in Table 5. After LSTM training, the test set data is input into LSTM, the denormalized results are compared with the actual results, and the error indicators are used to evaluate the prediction performance of LSTM.

Simulation comparison results analysis
The above sub-series components predicted by the LSTM model were overlaid and reconstruction was carried out to obtain the monthly precipitation predictions. In this section, the research idea of verifying the superiority of the CEEMD-PSO-LSTM prediction model is mainly divided into three steps, and the evaluation indexes are RMSE and MAE. In the first step, BP neural network [28], SVM [29] and ANN [30] are used to compare the prediction accuracy of the three commonly used models with the LSTM model to prove the advantages of LSTM in dealing with time series modeling problems. In the second step, compare the EMD-LSTM, EEMD-LSTM and CEEMD-LSTM [31,32] combined prediction models to prove the superiority of using CEEMD to process non-stationary data. In the third step, PSO-LSTM, CEEMD-LSTM and CEEMD-PSO-LSTM [33] models are compared to prove that the CEEMD-PSO-LSTM combined model has the best prediction performance. The prediction results of the above prediction model in the test set samples are shown in Figs 7 and 8, and the predictors are shown in Table 6 the point where the monthly precipitation changes greatly, which proves that the LSTM model can capture the temporal correlation in the data more effectively. Compared with EMD-LSTM and EEMD-LSTM, RMSE and MAE of CEEMD-LSTM are lower than those of the other two models. As can be seen from Fig 8, although the trend of these three models is basically the same as the actual value, the CEEMD-LSTM predicted value is basically consistent with the actual value at the inflection point, which shows the superiority of the CEEMD decomposition method for processing non-stationary data. Finally, compared with PSO-LSTM and CEEMD-LSTM models, the RMSE of CEEMD-PSO-LSTM model is reduced by 25.26% and 6.43% respectively, and the MAE is reduced by 21.14% and 8.37% respectively. In Fig 8, where the sample points fluctuate greatly, the CEEMD-PSO-LSTM method has better prediction results, which proves that the CEEMD-PSO-LSTM method has better estimated performance. Through Figs 8 and 9, the CEEMD-PSO-LSTM method improves the overall forecasting precision while controlling the deviation of most prediction points from the actual data within a small range. This is due to the fact that the model decomposes the monthly precipitation data into a number of sub-series with significant regularity before forecasting them separately, improving the accuracy of the prediction.

Conclusion
This paper combines the current research hotspots in the field of deep learning and focuses on the prediction accuracy of monthly precipitation to conduct research and establish a CEEMD-PSO-LSTM prediction model. Initially, the CEEMD decomposition algorithm was used to decompose the monthly precipitation with non-linearity and non-stationary characteristics into sub-sequences. Then, PSO was used to optimize the hyper-parameters of the LSTM network. Finally, the LSTM model was used to predict each sub-sequence, and the predicted results were combined to obtain the final monthly precipitation prediction. The following conclusions were drawn: (1) The CEEMD decomposition method is used to decompose the monthly precipitation series, which reduces the interaction between different time scale information. Compared with EMD-LSTM and EEMD-LSTM methods, the RMSE of CEEMD-LSTM is reduced by

PLOS ONE
13.78% and 3.75% respectively, MAE is reduced by 6.21% and 0.9% respectively,and MAPE is reduced by 13.83% and 6.95% respectively. It shows that the CEEMD effectively improves prediction accuracy.
(2) The PSO algorithm is used to optimize the hyper-parameters of the LSTM network, which avoids the contingency of manual selection. Compared with other optimization algorithms, PSO algorithm can quickly converge to the global optimal solution and have good robustness. LSTM network has certain superiority in handling time series data. Through the design of LSTM units, the LSTM network can effectively learn and remember information from previous time steps, thereby better handling long-term dependencies and improving prediction performance.
(3) The combined CEEMD-PSO-LSTM model has been developed to effectively improve the accuracy of monthly precipitation prediction. The model is suitable for processing nonsmooth, non-linear time-series data and can also be extended to the fields of electricity, traffic flow and text recognition.

PLOS ONE
(4) In the next step, other influencing factors [34][35][36][37] can be introduced, such as pressure, temperature, etc., to further improve the reliability and prediction accuracy of the model.